Efficient dictionary-based text rewriting using subsequential transducers
نویسندگان
چکیده
Problems in the area of text and document processing can often be described as text rewriting tasks: given an input text, produce a new text by applying some fixed set of rewriting rules. In its simplest form, a rewriting rule is given by a pair of strings, representing a source string (the “original”) and its substitute. By a rewriting dictionary, we mean a finite list of such pairs; dictionary-based text rewriting means to replace in an input text occurrences of originals by their substitutes. We present an efficient method for constructing, given a rewriting dictionary D, a subsequential transducer T that accepts any text t as input and outputs the intended rewriting result under the socalled “leftmost-longest match” replacement with skips, t′. The time needed to compute the transducer is linear in the size of the input dictionary. Given the transducer, any text t of length |t| is rewritten in a deterministic manner in time O(|t| + |t′|), where t′ denotes the resulting output text. Hence the resulting rewriting mechanism is very efficient. As a second advantage, using standard tools, the transducer can be directly composed with other transducers to efficiently solve more complex rewriting tasks in a single processing step.
منابع مشابه
Failure Transducers and Applications in Knowledge-Based Text Processing
Finite-state devices encoding lexica and related knowledge bases often become very large. A well-known technique for reducing the size of finite-state automata is the use of failure transitions. Here we generalize the concept of failure transitions for finite-state automata to the case of subsequential transducers. Failure transitions in the new sense do not have input but may produce output. A...
متن کاملOn Some Applications of Finite - State AutomataTheory to Natural Language
We describe new applications of the theory of automata to natural language processing: the representation of very large scale dictionaries and the indexation of natural language texts. They are based on new algorithms that we introduce and describe in detail. In particular, we give pseudocodes for the de-terminization of string to string transducers, the deterministic union of p-subsequential s...
متن کاملAutomatically deriving categories for translation
An adequate approach to speech translation for small to medium sized tasks is the use of subsequential transducers —a finite state model— as language model for a speech recognizer. These transducers can be automatically trained from sample corpora. The use of manually defined categories improves the training of the subsequential transducers when the available data are scarce. These categories d...
متن کاملProperties of some classes of rational relations
We study theoretical and algorithmic aspects of some classes of rational relations: the finite valued rational relations, the k-valued rational relations, for every positive integer k, the sequential functions, and the subsequential functions. At first, we study some classical results concerning the rational relations, the representations of rational relations by transducers and matrices, and s...
متن کاملRational Kernels for Arabic Stemming and Text Classification
In this paper, we address the problems of Arabic Text Classification and stemming using Transducers and Rational Kernels. We introduce a new stemming technique based on the use of Arabic patterns (Pattern Based Stemmer). Patterns are modelled using transducers and stemming is done without depending on any dictionary. Using transducers for stemming, documents are transformed into finite state tr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Natural Language Engineering
دوره 13 شماره
صفحات -
تاریخ انتشار 2007